Exploring Equity Classifications with Machine Learning

Proposal

DATE: May 20, 2021

TO: Annette Demchur, Rebecca Morgan

FROM: Margaret Atkinson

RE: Staff Initiated Study Proposal:

Exploring Equity Classifications with Machine Learning

I would like to conduct research to answer the following question: Could an unsupervised machine learning algorithm create groups of towns based on demographic information that would be useful to explore questions about equity? To explain, when we compare all towns that pass the minority threshold (or TAZs or block groups etc.) to all towns that do not, we may be missing the way the demographic variables interact and a multifactored grouping could allow us to explore the demographic towns with a more detailed approach without the blending nature of an index.

This project would use python and specifically the Scikit-Learn python library to conduct unsupervised machine learning based on demographic data at the town level. The data would be demographic Census data from the American Community Survey 5- year estimates at minimum on the topics of: Race/Ethnicity, Limited English Proficiency, Median Income, Low Income, No Car Households, Population Density, Children, and Seniors. The product would be a geographic file that shows groupings of towns by demographic profile as found by the unsupervised machine learning algorithm as well as a written description of what each grouping represents.

If the question is pursued and the results are useful - the ultimate intention (as a follow up project) would be to look at the way MPO distributes funding between the groups and within each group to look for disparities. Explanations of disparities could lead to a re-examination of the variables used in the algorithm in order to provide an additional check for equitable spending.

Results Summary:

After trying two different clustering methods (K-Means and DBSCAN) on the original data and on data representing the relationships between original data variables (Principal Components), I determined that the results of this study are not significant - the demographic variables chosen do not show multiple clusters that are of high enough quality to consider using. The confidence in the clusters created is low - truly it seems that there is one cluster and some outliers that only relate minimally to each other. In order to determine whether this was an issue with number of samples (towns), I applied the same methodology seen here to the Census Tracts of the MPO. A similar result was achieved even with the greater variation and quantity that the Tracts provide.

While this idea has not panned out in a useful way, I propose that the agency continue to think outside the box in terms of how we conduct equity analyses especially given the open source advanced tools that we could be using.

Demographic Data Used

Demographic Tables Fields Calculation Notes
Race/Ethnicity SF1 2010: P5 P005001-P005003 Total - White Alone Use Census per direction from Steven & Betsy
Limited English Proficiency ACS14: B16001 B16001_005 + B16001_008 + B16001_011 + B16001_014 + B16001_017 + B16001_020 + B16001_023 + B16001_026 + B16001_029 + B16001_032 + B16001_035 + B16001_038 + B16001_041 + B16001_044 + B16001_047 + B16001_050 + B16001_053 + B16001_056 + B16001_059 + B16001_062 + B16001_065 + B16001_068 + B16001_071 + B16001_074 + B16001_077 + B16001_080 + B16001_083 + B16001_086 + B16001_089 + B16001_092 + B16001_095 + B16001_098 + B16001_101 + B16001_104 + B16001_107 + B16001_110 + B16001_113 + B16001_116 + B16001_119 C16001 (less than very well)/(Total Population - (B01001_003 +B01001_027)
Median Income ACS14: B19013 B19013_001
% of HH with income below 200% of poverty line ACS14: C17002 C17002_002E + C17002_003E + C17002_004E + C17002_005E + C17002_006E + C17002_007E
Low Income Households ACS14: B19001, B19025, B11001 All of B19001, B19025_001, B11001_001 HH Income Ranges, Aggregate HH Income, Total HH
No Car Households ACS14: B08201 B08201_002 HH with no vehicles available
Population Density https://jtleider.github.io/censusdata/api.html B01001_001/AREA Total Pop / AREA These are shape, also use total population data
Children ACS14: B01001, 2010Cen: P12 (B01001_003 + B01001_004 + B01001_005 + B01001_006 + B01001_027 + B01001_028 + B01001_029 + B01001_030), (P012_003 + P012_004 + P012_005 + P012_006 + P012_027 + P012_028 + P012_029 + P012_030) Boys under 18 plus girls under 18 0-17
Population Over 5 ACS14: B01001, 2010Cen: P12 B01001_001 - (B01001_003 +B01001_027), P012_001 - (P012_003 + P012_027) Total Population - Children under 5 5+
Seniors ACS14: B01001 (B01001_020 + B01001_021 + B01001_022 + B01001_023 + B01001_024 + B01001_025) + (B01001_044 + B01001_045 + B01001_046 + B01001_047 + B01001_048 + B01001_049) Men ages 65+ plus Women ages 65+ 65+
People with Disabilities ACS14: S1810 S1810_C02_001E / S1810_C01_001E ( total pop with disability / total non institutionalized population) Includes: Ambulatory, Hearing, Vision, Self-Care, Cognitive, Independent Living Difficulties
Total Population ACS14: B01001, 2010Cen: P1 B01001_001, P01_001 Includes those housed in group quarters

Sources:

These are sources that I used to frame my methodology. These were used to assess the strengths and weaknesses of clustering approaches, how clustering algorithms should be calibrated, and how to assess the results of the clustering.

Set Up

Finding Tables and Fields Resources

These links are documentation for what the tables and values I am using in this next section:

https://www.census.gov/prod/cen2010/doc/sf1.pdf

https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.2014.html

These are just notes on what fields within what tables that I am using.

Grab ACS Data, Brief Clean, and Sum if Necessary

Machine Learning Section

Charts of Variables

Make graphs of all the variables to see what the ranges and relationships look like

So you don't have to scroll if you don't want:

newplot.png

Machine Learning Algorithm 1: K-Means

Description from https://scikit-learn.org/stable/modules/clustering.html#k-means:

"The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of samples into disjoint clusters , each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from , although they live in the same space.

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:

Drawbacks include that the algorithm is not deterministic - it will use random seed points each run which may not actually serve the algorithm well - for example all the seed points being close together. Having different seed points each time means that you get a different result every time you run this algorithm, even with the same parameters and order of input. Additionally - the algorithm uses all points in creating clusters meaning that noise is classified in clusters.

Optimizing Number of Clusters 1: Use Silhouette Analysis to Determine Optimal Number of Clusters for K-Means

"Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster." from https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py

Given the importance of the number of clusters to the K-means algorithm, I tested multiple ways of determining the ideal number of clusters for the input data. Silhouette Analysis is just one way to determine this. In this case, I am iteratively running a Sihouette Analysis (and visualizing it) on a range of cluster numbers.

Silhouette analysis is also used further below in testing the fit of the DBSCAN model clusters to the data. Sihouette analysis can be used with multiple algorithms.

I think something to note here is that we may still get a good cluster out of this even if most of the points are not clustered well. For example, 1 > 0 in silhouette score. But for cluster 1 the drop off is also larger and there are way more points adding to the variation. Something to think about if this all goes sideways.

Optimizing the Number of Clusters 2 & 3: Make Elbow Diagram for K-Means (Optimizing the Number of Clusters)

An alternate way to find the ideal number of clusters to run K-means with is the very popular Elbow diagram. In this method, you iterate through a list of values for k (number of clusters) as inputs to the K-means algorithm. After running the K-means algorithm with the k value, the distortion and inertia is calculated:

from https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

The ideal number of clusters should be at the 'elbow' of the line in the diagram. It is the k value where the distortion/inertia stops dropping as rapidly as k increases. See below for examples.

Try K-Means

(https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6062a083acbfe82c7195b27d/1617076404560/ISLR%2BSeventh%2BPrinting.pdf) Algorithm 10.1 K-Means Clustering

  1. Randomly assign a number, from 1 to K, to each of the observations.

    a. K-means++ initializes the the randomly assigned clusters with cluster centroids that are not as random. While it may not be the exact same every time, the initial cluster centroids are dispersed enough to be useful. Using K-means++ makes the algorithm more deterministic, though not totally. These serve as initial cluster assignments for the observations.

  1. Iterate until the cluster assignments stop changing:

    a. For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.

    b. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

Summary: K Means

Essentially - we do get three clusters and they are interesting

What the above parallel coordinates plot show is a visualization of the clusters based on the data for each town. As you can see, Yellow (2) and Blue (0) are pretty similar, but Pink (1) looks like a combination of outliers and data following a different pattern. Something that DBSCAN does naturally is not assign outliers to clusters. See below.

Machine Learning Algorithm 2: DBSCAN

Given the results and limitations of K-means, I felt it was necessary to try an alternate algorithm with different strengths and weaknesses.

"The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster." from https://scikit-learn.org/stable/modules/clustering.html#dbscan

DBSCAN is deterministic when the data is fed into the same order into the algorithm - so generally will be the same every time the algorithm is run unless the user re-sorts the data, but significantly more deterministic naturally than k-means without k-means++. The major benefit is the separation of noise and the use of density to identify clusters allowing for any shape of cluster.

DBSCAN Documentation

Optimizing the Parameters for DBSCAN:

As explained above, the main DBSCAN parameters are eps and min_samples. The work below shows how to optimize the eps parameter. The optimal min_samples was determined by rerunning the cell below with the min_samples as every number between the number of variables and twice the number of variables.

Try DBSCAN:

Additionally, run a Silhouette Analysis so that after running the DBSCAN algorithm, we can see how useful the clusters are. This time, since we do not need to run through a few different numbers of clusters, we can do this once. While I don't vizualize the Sihouette Analysis for the DBSCAN results, one could. I deemed it unnecessary given the low average Sihouette Coefficient, meaning a not so stellar fit. The data points are often close to the edge of the clusters. See Silhouette Analysis explanation above.

Actually Visualize the DBSCAN Clusters:

While not terribly useful given the average silhouette coefficient is so low, it does give us an idea of what the clusters look like. The clusters seem to confirm what the concerns silhouette analysis raised. It looks like the Yellow and Red clusters don't have much differentiating them and a lot of noise points that overalp the main cluster of data.

Why Not More Clustering Algorithms?

At this point - I have two algorithms that are not finding substantial evidence of this data being clustered. Trying more algorithms that may be more optimized to the problem would be useful only if it was determined that there were clusters, they were just not being correctly assigned. In this case, it seems unlikely with the current variables. I decided not to run any Hierarchical Clustering or Neural Netowrk algorithms in favor of doing a Principal Component Analysis to reduce the dimensions (variables) of the problem.

Try Principal Component Analysis

Since our results have not been substantial with nine dimensions, another way to see if there are clusters is to create them based on the relationships between the dimensions.

"PCA finds which features are most correlated in a dataset, and removes them, leaving you with the most “important” features - i.e., the “principal components.” If you wanted to choose a single feature from this dataset to represent the entire dataset, you would want to select the one that contains the most information. When faced with a gigantic, high-dimensional dataset, you could pick and choose the features you think are important. But this can be inefficient, and even worse, can lead to us missing out on potentially important data! PCA can provide a rigorous way of determining which features are important.

This is the fundamental idea of PCA: select the features that contain the most information. So how do we determine which features contain the most information? We measure variation (or, how different a feature is from another) as a proxy for information.

Why does measuring variation make sense? When two features are correlated, it doesn’t make sense to include them both in a model, because you only need one to capture the information of both. In PCA, we use a similar concept to detect and remove redundant and non-informative features. If a feature contributes very little variation, it can be removed." from https://drive.google.com/file/d/1YdA-HHYP1V05QgvwLCvfnuuau67Zl38n/view (Delta Analytics)

Results from this can be resused in both of the algorithms to see if better results can be achieved.

https://builtin.com/data-science/step-step-explanation-principal-component-analysis

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Try K Means Again - This time with PCA

Look at Silhouette Analysis for the post PCA data

Just keeping in good faith by looking at the Silhouette for K-Means - you can see by looking at the diagrams none are very good. Most have at least some negative points meaning that the data was likely put into the wrong cluster. The silhouette analysis has worse results with PCA for K-means than with the original nine dimensions.

Something to note is that while 2 clusters has the highest score, neither elbow diagram would agree with that. 4 clusters has the next highest score, though not very high, which matches the inertia elbow diagram.

Lastly, the scatter plot points are plotted by PC1 and PC2. PC3 will be used for plotting later in the analysis.

As you can see above - it looks like the Yellow and Orange clusters are outliers and the Blue and Purple don't necessarily seem like it should be two clusters.

Below - you can see what the clusters look like on a map. Doesn't seem like anything we would predict except for Yellow, which is fine given that we are looking for patterns that have meaning but also are different from what we would predict. The issue is the level of meaning that these clusters bring, which is not high.

Confirming what we saw in the 3D Scatter plot, we can see that there is overlap between Purple and Blue and that while Yellow and Orange are distinct, they are somewhat spread out and only include the minority of features.

Along with the low silhouette score, it doesn't look like the K-means clusters on the data post PCA are particularly helpful as most of the features are in two clusters that do not necessarily look like they should be two separate clusters. Given the outliers of Yellow and Orange, let's run DBSCAN again to see what clusters it predicts.

Try DBSCAN Again with PCA

That doesn't look so good as we are only getting one cluster that contains most of the points, but.... the silhouette coefficient looks better than previous iterations! So fine fit, bad result.

Lets go forward anyways and see what it looks like.

This classification makes sense for DBSCAN in the 3D space as the yellow cluster is dense where as the noise (blue) is not dense at all. However, I don't think this result is very helpful in determining patterns within the data. We have one cluster with the majority of features and noise which is not very related at all. Since the point of this exercise was to seek similarities and differences between county subdivisions based on demographic data, this does not fulfill that goal.

For futher exploration at the tract level - see next notebook.